#load librarieslibrary(tidyverse) #for data science, loads other core libraries
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.4
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Code
library(kableExtra) #for table styling
Attaching package: 'kableExtra'
The following object is masked from 'package:dplyr':
group_rows
Code
library(scales) #for scaling axes
Attaching package: 'scales'
The following object is masked from 'package:purrr':
discard
The following object is masked from 'package:readr':
col_factor
Code
library(reshape2) #for reshaping data
Attaching package: 'reshape2'
The following object is masked from 'package:tidyr':
smiths
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
Code
#load the datasetdiabetic_data <-read_csv("diabetic_data.csv", show_col_types =FALSE)
1. Problem Definition
Hospital readmissions are a critical healthcare problem due to the significant impact they have on patient’s health, healthcare costs, and the overall efficiency of the healthcare system. Hospital readmissions, primarily those occurring within 30 days of discharge, are a key indicator of the healthcare quality provided to patients with chronic conditions like diabetes. Predicting diabetic patient readmissions is significant for improving the overall healthcare quality and implementing proper post-discharge support and intervention plans, ultimately improving patient’s long-term health and reducing unnecessary healthcare costs (Sharma et al. 2019).
The classification task this project aims to tackle is to predict diabetic patient readmission status within 30 days of discharge, using data collected from 130 United States (US) hospitals between 1999 and 2008. The dataset consists of various attributes on patient demographics, medical and hospitalization records, and treatment procedures, providing details on the factors contributing to patient readmission status (Strack et al. 2014). This is a multi-class classification task, where the target variable readmitted \(y\) has three distinct classes.
\[
y = \begin{cases}
<30 & \text{(patient was readmitted within 30 days)} \\
>30 & \text{(patient was readmitted after 30 days)} \\
\text{No} & \text{(patient was not readmitted)}
\end{cases}
\]
Developing a classification model to predict diabetic patient readmission status will contribute to improving healthcare quality, improving patient health and long-term health outcomes, avoiding unnecessary readmission costs, supporting clinical decision-making, and improving hospital’s operational efficiency (V R Reji Raj and Mr. Rasheed Ahammed Azad. V 2023).
2. Data Description
The dataset used in this assignment titled “Diabetes 130-US Hospitals for Years 1999-2008” was obtained from the Health Facts database, a national warehouse that collects comprehensive clinical records from hospitals across the US (Strack et al. 2014). The raw dataset contains 101,767 records and 50 attributes, collected from 130 hospitals between 1999 and 2008. The dataset includes categorical features representing patient and hospital detailes, such as race, gender, medical specialty, diagnoses, 23 medication-related features (e.g., metformin, insulin, etc.), and numerical attributes such as patient number, admission type and number of lab procedures.
Data Dictionary
Code
#create table for dataset features name and typedatatype_tbl <-tibble(Variable =names(diabetic_data),Type =sapply(diabetic_data, class))#split into variable-type pairspairs <- datatype_tbl %>%mutate(Pair =map2(Variable, Type, ~c(.x, .y))) %>%pull(Pair) %>%unlist()#set a table of 6 columns for better displaynum_cols <-6rows_needed <-ceiling(length(pairs) / num_cols)length(pairs) <- rows_needed * num_cols #set column names and conver to matrixcol_names <-rep(c("Variable", "Type"), num_cols /2)pair_matrix <-matrix(pairs, ncol = num_cols, byrow =TRUE)#display table using kablekable(pair_matrix, col.names = col_names, escape =FALSE) %>%kable_styling(full_width =FALSE)
Variable
Type
Variable
Type
Variable
Type
encounter_id
numeric
patient_nbr
numeric
race
character
gender
character
age
character
weight
character
admission_type_id
numeric
discharge_disposition_id
numeric
admission_source_id
numeric
time_in_hospital
numeric
payer_code
character
medical_specialty
character
num_lab_procedures
numeric
num_procedures
numeric
num_medications
numeric
number_outpatient
numeric
number_emergency
numeric
number_inpatient
numeric
diag_1
character
diag_2
character
diag_3
character
number_diagnoses
numeric
max_glu_serum
character
A1Cresult
character
metformin
character
repaglinide
character
nateglinide
character
chlorpropamide
character
glimepiride
character
acetohexamide
character
glipizide
character
glyburide
character
tolbutamide
character
pioglitazone
character
rosiglitazone
character
acarbose
character
miglitol
character
troglitazone
character
tolazamide
character
examide
character
citoglipton
character
insulin
character
glyburide-metformin
character
glipizide-metformin
character
glimepiride-pioglitazone
character
metformin-rosiglitazone
character
metformin-pioglitazone
character
change
character
diabetesMed
character
readmitted
character
NA
NA
The Diabetes 130-US Hospitals for Years 1999-2008 dataset is very challenging, as it meets all the Dataset Selection Guidelines stated in the project requirements.
Large Size: The size of the dataset is large, as it contains over 100,000 records.
Messy: The dataset is considered messy, with more than 192,000 missing values identified across multiple features.
Complex: The dataset is considerably complex, as it contains a mix of categorical and numerical attributes representing complex relationships between multiple health factors.
Integration: The data was integrated from different data sources that included 130 hospitals.
Multi-class: The dataset involves a multi-class classification task based on the readmitted attribute (<30 if the patient was readmitted within 30 days, >30 if readmitted after 30 days, and No if the patient was not readmitted).
These aspects highlight the dataset’s complexity and suitability for evaluating real-world clinical scenarios using computational statistical models.
3. Clean and Prepare
Load, Understand, and Categorize Data
Code
Code
#get dimensions of the datasetdim(diabetic_data)
[1] 101766 50
Code
#display structure of the datasetstr(diabetic_data)
#dummary statistics for each columnsummary(diabetic_data)
encounter_id patient_nbr race gender
Min. : 12522 Min. : 135 Length:101766 Length:101766
1st Qu.: 84961194 1st Qu.: 23413221 Class :character Class :character
Median :152388987 Median : 45505143 Mode :character Mode :character
Mean :165201646 Mean : 54330401
3rd Qu.:230270888 3rd Qu.: 87545950
Max. :443867222 Max. :189502619
age weight admission_type_id
Length:101766 Length:101766 Min. :1.000
Class :character Class :character 1st Qu.:1.000
Mode :character Mode :character Median :1.000
Mean :2.024
3rd Qu.:3.000
Max. :8.000
discharge_disposition_id admission_source_id time_in_hospital
Min. : 1.000 Min. : 1.000 Min. : 1.000
1st Qu.: 1.000 1st Qu.: 1.000 1st Qu.: 2.000
Median : 1.000 Median : 7.000 Median : 4.000
Mean : 3.716 Mean : 5.754 Mean : 4.396
3rd Qu.: 4.000 3rd Qu.: 7.000 3rd Qu.: 6.000
Max. :28.000 Max. :25.000 Max. :14.000
payer_code medical_specialty num_lab_procedures num_procedures
Length:101766 Length:101766 Min. : 1.0 Min. :0.00
Class :character Class :character 1st Qu.: 31.0 1st Qu.:0.00
Mode :character Mode :character Median : 44.0 Median :1.00
Mean : 43.1 Mean :1.34
3rd Qu.: 57.0 3rd Qu.:2.00
Max. :132.0 Max. :6.00
num_medications number_outpatient number_emergency number_inpatient
Min. : 1.00 Min. : 0.0000 Min. : 0.0000 Min. : 0.0000
1st Qu.:10.00 1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.: 0.0000
Median :15.00 Median : 0.0000 Median : 0.0000 Median : 0.0000
Mean :16.02 Mean : 0.3694 Mean : 0.1978 Mean : 0.6356
3rd Qu.:20.00 3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.: 1.0000
Max. :81.00 Max. :42.0000 Max. :76.0000 Max. :21.0000
diag_1 diag_2 diag_3 number_diagnoses
Length:101766 Length:101766 Length:101766 Min. : 1.000
Class :character Class :character Class :character 1st Qu.: 6.000
Mode :character Mode :character Mode :character Median : 8.000
Mean : 7.423
3rd Qu.: 9.000
Max. :16.000
max_glu_serum A1Cresult metformin repaglinide
Length:101766 Length:101766 Length:101766 Length:101766
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
nateglinide chlorpropamide glimepiride acetohexamide
Length:101766 Length:101766 Length:101766 Length:101766
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
glipizide glyburide tolbutamide pioglitazone
Length:101766 Length:101766 Length:101766 Length:101766
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
rosiglitazone acarbose miglitol troglitazone
Length:101766 Length:101766 Length:101766 Length:101766
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
tolazamide examide citoglipton insulin
Length:101766 Length:101766 Length:101766 Length:101766
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
glyburide-metformin glipizide-metformin glimepiride-pioglitazone
Length:101766 Length:101766 Length:101766
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
metformin-rosiglitazone metformin-pioglitazone change
Length:101766 Length:101766 Length:101766
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
diabetesMed readmitted
Length:101766 Length:101766
Class :character Class :character
Mode :character Mode :character
The dataset named “diabetic_data.csv” was loaded using read_csv() with show_col_types = FALSE to suppress the warning message that appears when column data types are automatically inferred. str() function was used to examine the structure of the dataset, whereas summary() function was used to provide key descriptive statistics such as minimum, median, mean, and maximum values. From the functions output it was observed that most of the diabetes dataset features fall into two main data types:
Numeric: These include hospitalization duration, lab test results, medications, and patient counts (e.g., time_in_hospital, num_procedures, num_medications).
Character: These include categorical attributes such as race, gender, age, diagnosis codes (diag_1, diag_2, diag_3), and medication-related features like insulin.
Identify and Handle Missing Values
Once a general understanding was gained on the dataset, the dataset was examined for missing and null values. The dataset was first scanned for standard NA valus using is.na() function. This check returned zero missing values, indicating that missing data might have been recorded in alternative forms. Upon further inspection of the dataset, it was found that missing values were represented in various non-standard forms such as “?” and “Unknown/Invalid”. These are often used to indicate missing or uncertain information. To systematically identify such entries, a loop was implemented to scan each column and count the occurrences of these values. After conducting a second-level inspection for missing values, eight features were identified with non-standard missing entries. These were primarily represented by question marks “?” and a few “Unknown/Invalid” values. The figure below displays the missing values distribution, all these missing values were replaced with NA for uniformity.
Code
Code
#check for actual NA values in each columnna_counts <-colSums(is.na(diabetic_data))#convert to a data framena_summary <-enframe(na_counts, name ="Column", value ="NA_Count")#display table using kablekable(na_summary, caption ="Standard Missing Values (NA) per Column") %>%kable_styling(full_width =FALSE, position ="center")
Standard Missing Values (NA) per Column
Column
NA_Count
encounter_id
0
patient_nbr
0
race
0
gender
0
age
0
weight
0
admission_type_id
0
discharge_disposition_id
0
admission_source_id
0
time_in_hospital
0
payer_code
0
medical_specialty
0
num_lab_procedures
0
num_procedures
0
num_medications
0
number_outpatient
0
number_emergency
0
number_inpatient
0
diag_1
0
diag_2
0
diag_3
0
number_diagnoses
0
max_glu_serum
0
A1Cresult
0
metformin
0
repaglinide
0
nateglinide
0
chlorpropamide
0
glimepiride
0
acetohexamide
0
glipizide
0
glyburide
0
tolbutamide
0
pioglitazone
0
rosiglitazone
0
acarbose
0
miglitol
0
troglitazone
0
tolazamide
0
examide
0
citoglipton
0
insulin
0
glyburide-metformin
0
glipizide-metformin
0
glimepiride-pioglitazone
0
metformin-rosiglitazone
0
metformin-pioglitazone
0
change
0
diabetesMed
0
readmitted
0
Code
#create empty vectors to store resultscolumn_names <-c()question_marks <-c()empty_strings <-c()unknowns <-c()unknown_invalids <-c()#loop through each columnfor (i in1:ncol(diabetic_data)) { col_name <-colnames(diabetic_data)[i] col_data <- diabetic_data[[i]] #count each missing type qmark <-sum(col_data =="?", na.rm =TRUE) empty <-sum(col_data =="", na.rm =TRUE) unk <-sum(col_data =="Unknown", na.rm =TRUE) unk_inv <-sum(col_data =="Unknown/Invalid", na.rm =TRUE)#only record if at least one missing-like value existsif (qmark + empty + unk + unk_inv >0) { column_names <-c(column_names, col_name) question_marks <-c(question_marks, qmark) empty_strings <-c(empty_strings, empty) unknowns <-c(unknowns, unk) unknown_invalids <-c(unknown_invalids, unk_inv) }}#combine all into one data frame (after the loop)missing_summary <-data.frame(Column = column_names,Question_Mark = question_marks,Empty_String = empty_strings,Unknown = unknowns,Unknown_Invalid = unknown_invalids)
Code
#display table using kablekable(missing_summary, align ="lcccc", caption ="Summary of Missing Values") %>%kable_styling(full_width =FALSE, position ="center")
Summary of Missing Values
Column
Question_Mark
Empty_String
Unknown
Unknown_Invalid
race
2273
0
0
0
gender
0
0
0
3
weight
98569
0
0
0
payer_code
40256
0
0
0
medical_specialty
49949
0
0
0
diag_1
21
0
0
0
diag_2
358
0
0
0
diag_3
1423
0
0
0
Code
#prepare data for ggplotmissing_long <- missing_summary %>%pivot_longer(cols =c(Question_Mark, Empty_String, Unknown, Unknown_Invalid),names_to ="Type",values_to ="Count" )#plot missing value countp <-ggplot(missing_long, aes(x =reorder(Column, -Count), y = Count, fill = Type)) +geom_bar(stat ="identity", position ="dodge") +labs(title ="Summary of Missing Values Counts",x ="Feature",y ="Missing Count",fill ="Missing Type" ) +scale_fill_viridis_d(option ="D", begin =0.1, end =0.9) +#color-blind friendlytheme_minimal(base_size =10) +theme(axis.text.x =element_text(angle =40, hjust =1),plot.title =element_text(face ="bold", size =10, hjust =0.5))#make plot interactiveggplotly(p)
Code
Code
#replace "?" with NAdiabetic_data[diabetic_data =="?"] <-NA#replace "Unknown/Invalid" with NAdiabetic_data[diabetic_data =="Unknown/Invalid"] <-NA
The following features were removed for the dataset as each of these features has more than 40% missing values. Additionally, these features are not not essential for building a reliable predictive model nor impact the classification task to predict diabetic patient readmission status within 30 days of discharge.
weight represents the patient’s body weight. More than 97% of weight values are missing. Due to its lack of completeness, it is not practical to include this feature.
payer_code refers to the patient’s method of payment or insurance with approximately 40% missing data. This feature is more administrative than predictive and has minimal relevance to the classification goal.
medical_specialty indicates the specialty of the admitting physician, with nearly 50% of missing values. Additionally, this feature contains a large number of inconsistent and sparse categories, which will not be beneficial during model building and training.
Since the dataset is already large and only a reletvily small number of observations had missing values in important features like race, gender, and diagnosis codes, it was decided to remove those observations reduceding the dataset from 101,766 to 98,052 observations, keeping only complete records for analysis. Additionally, the features encounter_id (a unique ID for each hospital visit) and patient_nbr (a unique ID for each patient) were removed since these two features do not provide any useful information for predicting diabetic patient readmission status.
#remove rows that contain any NA valuesdiabetic_data <-na.omit(diabetic_data)dim(diabetic_data)
[1] 98052 47
Code
#remove the encounter ID and patient number columnsdiabetic_data <- diabetic_data %>%select(-c(encounter_id, patient_nbr ))dim(diabetic_data)
[1] 98052 45
Features Engineering for Categorical Features
The number of unique values in each column are counted to identify features with high cardinality. Most medication-related features had only 2–4 unique values, making them straightforward to include in the model. However, the diagnosis features (diag_1, diag_2, and diag_3) contained hundreds of unique ICD-9 codes, the official system of assigning codes to diagnoses associated with hospital in the US (Strack et al. 2014). Thus, using the diagnosis codes directly would be complex and may lead to overfitting. As these codes are highly uninterpretable, they were mapped into broader disease groups, following guidelines from the reference documentation (Strack et al. 2014). This transformation preserves the medical meaning while reducing the number of categories. For example, codes between 390–459 and 785 were mapped to “Circulatory”, while codes from 460–519 and 786 were grouped under “Respiratory”. Some codes such as “V45” did not fall into any of the defined ICD-9 groupings, thus were assigned the group “Unknown” to preserve them for further review.
Code
Code
for (col_name incolnames(diabetic_data)) {print(col_name)print(length(unique(diabetic_data[[col_name]])))}
The original diagnosis featuers diag_1, diag_2, and diag_3 were removed after new diagnosis group feature was engineered. This transformation helps reduce high cardinality and potential noise, making the data more suitable for building models. The intermediate columns used during the transformation (such as prefixes and numeric conversions) were also dropped to keep the dataset clean and focused on meaningful features. The diagnosis values that could not be mapped to ICD-9 categories and were labeled as “Unknown”, lacked clinical meaning and were thus removed from the analysis. Removing these observations helps ensure cleaner inputs for modeling and reduces the risk of introducing noise or ambiguity during model training. As a result, the dataset was reduced from 98,052 to 56,673 observations.
Code
Code
#drop original diagnosis columns: diag_1, diag_2, diag_3diabetic_data <- diabetic_data %>%select(-c(diag_1, diag_2, diag_3))#drop helper columns used for processingdiabetic_data <- diabetic_data %>%select(-c(diag_1_prefix, diag_2_prefix, diag_3_prefix))diabetic_data <- diabetic_data %>%select(-c(diag_1_num, diag_2_num, diag_3_num))dim(diabetic_data)
[1] 98052 45
Code
diabetic_data$diag_1_group[diabetic_data$diag_1_group =="Unknown"] <-NAdiabetic_data$diag_2_group[diabetic_data$diag_2_group =="Unknown"] <-NAdiabetic_data$diag_3_group[diabetic_data$diag_3_group =="Unknown"] <-NAdiabetic_data <-na.omit(diabetic_data)#check new dimensionsdim(diabetic_data)
[1] 56673 45
Based on the summary of unique values and a review of the dataset, two features were further dropped: discharge_disposition_id (26 unique values) and admission_source_id (15 unique values). Although these columns are stored as numeric values, they actually represent nominal categories (e.g., source of admission, discharge status) . Keeping these features would require encoding them into many sub-features using one-hot encoding. This could increase dimensionality and make the model more complex, especially since we don’t have labels explaining what each ID value represents. More importantly, using them directly in models (especially linear models or distance-based models like KNN) can confuse the model, as it might interpret the numerical IDs incorrectly. For example, it may think that ID 4 is twice as important as ID 2, which is not true. Furthermore, all character-type features were converted to factors to ensure categorical variables are properly recognized and interpreted by statistical models.
#convert all character columns to factorsfor (col innames(diabetic_data)) {if (class(diabetic_data[[col]]) =="character") { diabetic_data[[col]] <-as.factor(diabetic_data[[col]]) }}#show which columns are now factorsstr(diabetic_data)
To further clean and prepare the dataset for modeling, dataset outliers shall be identified and appropriately handled to reduce outliers impact on model training, model performance, and improve the robustness of predictive results. The dataset numerical features were examined for potential outliers using the Interquartile Range \(IQR\) method. This method uses IQR to count the number of outliers in each numeric feature and stores the results in a data-frame for further analysis and visualization. The numerical features in the dataset were first identified using is.numeric() function. Each numeric feature is processed through the \(IQR\) outlier detection method: calculating the first quartile \(Q1\), third quartile \(Q3\), the lower and upper bounds (defined as \(Q1 -1.5 × IQR\) and \(Q3 + 1.5 × IQR\), respectively).
The boxplots below provides a comparative view of the numerical features with outliers that were identified by the IQR method. Features such as num_lab_procedures, num_medications, and various visits (number_emergency, number_inpatient, number_outpatient) exhibit a clear right-skewedness distribution with a large number of high-end outliers. These features may need special treatment before normalization or modeling to avoid bias in the results.
#display table using kablekable(outlier_summary, caption ="Numerical Features Outliers (Using IQR)") %>%kable_styling(full_width =FALSE, position ="center")
Numerical Features Outliers (Using IQR)
Feature
Outlier_Count
number_outpatient
9093
number_emergency
6090
number_inpatient
3908
num_medications
1631
time_in_hospital
1242
admission_type_id
204
num_lab_procedures
133
number_diagnoses
25
num_procedures
0
Code
#reshape to long format for ggplotdiabetic_data_long <- diabetic_data[, outlier_summary$Feature] %>%mutate(row_id =row_number()) %>%pivot_longer(cols =-row_id, names_to ="Feature", values_to ="Value")#plot boxplot using ggplotp <-ggplot(diabetic_data_long, aes(x = Feature, y = Value, color = Feature)) +geom_boxplot() +theme_minimal(base_size =10) +theme(axis.text.y =element_text(size =10),axis.text.x =element_text(angle =40),plot.title =element_text(face ="bold", size =10, hjust =0.3),legend.position ="none" ) +scale_y_continuous(expand =expansion(mult =c(0.05, 0.1))) +labs(title ="Numeric Features with Outliers",x ="Feature",y ="Value" )#make plot interactiveggplotly(p)
To handle outliers, we iterated over all numeric variables and marked observations containing an outlier for easier identification. Since normalization (like Min-Max scaling) is highly sensitive to extreme values the presence of outliers in the dataset will have an extreme impact. If we normalize before removing outliers, the min and max values will be skewed, and most of the data will be squished into a narrow range near 0, making it harder for the model to learn from the majority of the data. Thus, observations with outliers values are removed resulting in a dataset with 39,304 observations and 43 features.
Code
Code
#create a logical index of rows to keep (initialize with TRUEs)keep_rows <-rep(TRUE, nrow(diabetic_data))#loop over numeric columns and mark rows with outliersfor (col innames(numeric_cols)){ values <- diabetic_data[[col]] Q1 <-quantile(values, 0.25) Q3 <-quantile(values, 0.75) IQR <- Q3 - Q1 lower_bound <- Q1 -1.5* IQR upper_bound <- Q3 +1.5* IQR#identify outliers for this feature outlier_mask <- values < lower_bound | values > upper_bound#mark FALSE for rows that are outliers in this column keep_rows <- keep_rows &!outlier_mask}#apply the filter oncediabetic_data <- diabetic_data[keep_rows, ]#check new dimensionsdim(diabetic_data)
[1] 39304 43
Normalization
Min-Max normalization is applied to selected numeric features as they have different scales and large ranges, which could unfairly influence models that are sensitive to feature magnitude, such as KNN or logistic regression. Based on the output of the summary() function, it was observed that the following six features had relatively high maximum values or wide ranges: num_lab_procedures, num_medications, number_outpatient, number_emergency, number_inpatient, and number_diagnoses. These features were normalized using Min-Max scaling, which transforms their values to fall between 0 and 1. This ensures that no single feature dominates the learning process due to having larger numeric values. Features that had small ranges or represented categorical values (such as admission_type_id) were left unchanged, as normalization is not meaningful or necessary in those cases.
Exploratory Data Analysis (EDA) has been conducted on the cleaned dataset to gain a comprehensive insights into the dataset, uncover patterns, and identify modeling strategies. The analysis focuses the target variable readmitted and the dataset features that are potentially associated with readmission status, contributing to the classification goal to predict diabetic patient readmission status within 30 days of discharge.
Target Variable Distribution
The bar chart below visualizes the distribution of the target variable readmitted. It can be noted that more than 50% of patients were not readmitted, around 30% of patients were readmitted after 30 days, and only 10% were readmitted within 30 days. The bar chart reveals a significant class imbalance, where the high-risk class readmitted within 30 days (<30) being the minority class. This class imbalance highlights the need for class weights adjustment during modeling to improve classifier sensitivity.
Code
Code
#summary table with three columns and rename them as followreadmit_dist <- diabetic_data %>% dplyr::count(readmitted, name ="Total_Patients") %>% dplyr::mutate(Proportion = Total_Patients /sum(Total_Patients),Percentage =percent(Proportion, accuracy =1) ) %>%rename(`Readmission Category`= readmitted) %>%select(`Readmission Category`, Total_Patients, Percentage)
Code
#plot readmission class distributionp <-ggplot(readmit_dist, aes(x =reorder(`Readmission Category`, -Total_Patients), y = Total_Patients, fill =`Readmission Category`)) +geom_col(width =0.6, show.legend =FALSE) +geom_text(aes(label =paste0(Percentage, "\n(", comma(Total_Patients), ")")),vjust =-0.3,size =2,fontface ="bold" ) +scale_fill_viridis_d(option ="D", begin =0.1, end =0.9) +#color-blind friendlyscale_y_continuous(labels = comma, expand =expansion(mult =c(0, 0.1))) +labs(title ="Distribution of Readmission Status",x ="Readmission Category",y ="Number of Patients" ) +theme_minimal(base_size =10)+theme(plot.title =element_text(face ="bold", size =10, hjust =0.5))#make plot interactiveggplotly(p)
Patients Demographic Composition
Patients demographic features such as gender, age, and race were analyzed to uncover readmission patterns and understand where the bulk of readmissions is coming from. The figure below highlights that readmission status are relatively balanced between males and females, with both gender having similar readmission patterns, indicating no major impact. Looking at the age groups the figure highlights most patients are between 50 and 80 years old, where we see the highest readmission counts can be noted, indicating that age is a possible factor for impacting readmission status. The higher counts within caucasian patients reflects population volume rather than higher risk of readmission, as caucasian patients make up the largest racial group within the dataset; therefore, proportions analysis was also conducted to provide a more accurate comparison.
Code
#gender plotp1 <-ggplot(diabetic_data, aes(x = gender, fill = readmitted)) +geom_bar(position ="dodge") +labs(x ="Gender",fill ="Readmitted" ) +scale_fill_viridis_d(option ="D", begin =0.1, end =0.9) +#color-blind friendlytheme_minimal(base_size =10) +theme(plot.title =element_text(face ="bold", hjust =0.2),legend.position ="right" )#age plotp2 <-ggplot(diabetic_data, aes(x = age, fill = readmitted)) +geom_bar(position ="dodge") +labs(x ="Age",fill ="Readmitted" ) +scale_fill_viridis_d(option ="D", begin =0.1, end =0.9) +#color-blind friendlytheme_minimal(base_size =10) +theme(plot.title =element_text(face ="bold", hjust =0.2),axis.text.x =element_text(angle =40, hjust =0.2) )#race plotp3 <-ggplot(diabetic_data, aes(x = race, fill = readmitted)) +geom_bar(position ="dodge") +labs(x ="Race",fill ="Readmitted" ) +scale_fill_viridis_d(option ="D", begin =0.1, end =0.9) +#color-blind friendlytheme_minimal(base_size =10) +theme(plot.title =element_text(face ="bold", hjust =0.2),axis.text.x =element_text(angle =40, hjust =0.2) )#combine plots horizontally and make plot interactivesubplot_combined <-subplot(ggplotly(p1),ggplotly(p2) %>%layout(showlegend =FALSE),ggplotly(p3) %>%layout(showlegend =FALSE),nrows =1,shareX =FALSE,shareY =FALSE,titleX =TRUE,titleY =FALSE) %>%layout(title =list(text ="Patient Readmissions by Gender, Age, and Race",x =0.5,xanchor ="center",font =list(size =16, family ="Arial", color ="#333333") ),annotations =list(list(text ="Number of Patients",x =0,xref ="paper",y =0.5,yref ="paper",showarrow =FALSE,font =list(size =12),textangle =-90 ) ) )#display the final resultsubplot_combined
Patient Profile Comparison
The radar chart below provides a comparative overview of patient profiles across the three readmission classes. Patients who were readmitted within 30 days exhibit higher counts across most variables, including number of medications, length of hospital stay, emergency room visits, and inpatient encounters, suggesting a more complicated medical profile. In contrast, patients who were not readmitted generally demonstrate lower counts across most variables, with slight exception on the number of procedures, which may reflect more planned preventive care or early interventions rather than medical complications.
Code
#get the average profile for each readmission groupradar_data <- diabetic_data %>%group_by(readmitted) %>%summarise(Medications =mean(num_medications, na.rm =TRUE),Procedures =mean(num_procedures, na.rm =TRUE),TimeInHospital =mean(time_in_hospital, na.rm =TRUE),ERVisits =mean(number_emergency, na.rm =TRUE),InpatientVisits =mean(number_inpatient, na.rm =TRUE),OutpatientVisits =mean(number_outpatient, na.rm =TRUE),.groups ="drop" )#fmsb needs the first two rows to define the range (max + min) of the axesradar_chart <-rbind(apply(radar_data[,-1], 2, max),apply(radar_data[,-1], 2, min), radar_data[,-1])#convert to numericradar_chart <-as.data.frame(lapply(radar_chart, as.numeric))rownames(radar_chart) <-c("Max", "Min", radar_data$readmitted)#set custom color-blind friendly colorscustom_colors <-c("#21908C", "#440154", "#5DC863")colors_fill <- scales::alpha(custom_colors, 0.3)#plot radar chartradarchart( radar_chart,axistype =1,pcol = custom_colors,pfcol = colors_fill,plwd =2,plty =1,cglcol ="grey",cglty =1,axislabcol ="grey30",vlcex =0.85,title ="Patient Profile Comparison by Readmission Status")#add a legend to keep it readablelegend("topright", legend = radar_data$readmitted,bty ="n", pch =20, col = custom_colors, text.col ="black", cex =0.9)
Diagnosis-Based Readmission Patterns
The feature engineering carried out on the diagnosis codes feature, in the clean and prepare phase, facilitated an interpretable analysis on the impact of the diagnosis categories on the patient readmission status. The chart shows the distribution of diagnosis categories the three diagnosis levels (diag_1, diag_2, and diag_3), grouped by readmission status. Circulatory and Other conditions are the most frequent across all diagnosis levels, especially in the primary diagnosis (diag_1). In contrast, conditions like Diabetes and Neoplasms are more frequently recorded as secondary or tertiary issues, suggesting their significant impact on patient readmission status. Overall, this visualization provides insights on the underlying medical conditions that are possibly associated with hospital readmission, uncovering readmission pattrens.
Code
#combine diagnosis group variables for plottingdiag_long <- diabetic_data %>%select(readmitted, diag_1_group, diag_2_group, diag_3_group) %>%pivot_longer(cols =starts_with("diag_"), names_to ="Diagnosis_Level", values_to ="Diagnosis_Group")#clean label namesdiag_long$Diagnosis_Level <-recode(diag_long$Diagnosis_Level,diag_1_group ="Diagnosis 1",diag_2_group ="Diagnosis 2",diag_3_group ="Diagnosis 3")#plot bar chartsp <-ggplot(diag_long, aes(x =fct_infreq(Diagnosis_Group), fill = readmitted)) +geom_bar(position ="dodge") +scale_fill_viridis_d(option ="D", begin =0.1, end =0.9) +#color-blind friendlylabs(title ="Readmission Count by Diagnosis Level (diag_1, diag_2, diag_3)",x ="Diagnosis Group",y ="Number of Patients",fill ="Readmitted" ) +facet_wrap(~ Diagnosis_Level, ncol =1, scales ="free_x") +theme_minimal(base_size =10) +theme(axis.text.x =element_text(angle =35, hjust =1, face ="bold"),strip.text =element_text(face ="bold"),legend.position ="right" )ggplotly(p, height =600)
Correlation Between Numeric Variables
The heatmap below presents the correlation matrix for the numeric features in the dataset, offering a snapshot of how these numeric features are correlated within the dataset. Overall, the correlations revealed that most numeric features are moderately correlated, indicating that each numeric features provide different types of information rather than association. key observations include:
A moderate positive correlation between time_in_hospital and both num_lab_procedures (0.33) and num_medications (0.43), indicating longer hospital stays are associated with more procedures and medication.
A low positive correlation between between most features, such as number_inpatient and num_procedures, indicating their independent value.
The correlation matrix supported the inclusion the dataset numeric features in the classifier modeling, as they appear to contribute unique information that supports the underlining classification goal to predict diabetic patient readmission status within 30 days of discharge.
Code
#identify numeric columnsnumeric_vars <- diabetic_data[sapply(diabetic_data, is.numeric)]numeric_vars <- numeric_vars[, colSums(!is.na(numeric_vars)) >0]#prepare correlation matrixcor_matrix <-cor(numeric_vars, use ="complete.obs")cor_df <-melt(cor_matrix)#base heatmap with better visual harmonyp <-ggplot(cor_df, aes(x = Var2, y = Var1, fill = value)) +geom_tile(color ="white") +geom_text(aes(label =round(value, 2)), color ="black", size =3.5) +scale_fill_gradient2(low ="#440154", # red for negativemid ="white", # neutralhigh ="#21908C", # green for positivemidpoint =0,limits =c(-1, 1),name ="Correlation" ) +labs(title ="Correlation Between Patient Numeric Features",x =NULL, y =NULL ) +theme_minimal(base_size =10) +theme(axis.text.x =element_text(angle =45, hjust =1, face ="bold"),axis.text.y =element_text(face ="bold"),legend.title =element_text(face ="bold"),plot.title =element_text(face ="bold", hjust =0.5) )ggplotly(p)
EDA Insights and Summary
The EDA provided valuable insights into the factors that most likely to influence hospital readmission among diabetic patients. Though some findings confirmed our initial expectations of the underlying dataset, others revealed more insightful patterns.
The target variable readmission revealed that majority of patient were not readmitted or were readmitted after 30 days. A smaller subset of patients, yet medically significant, were readmitted within 30 days, which confirmed a class imbalance that will be accounted for during modeling stage.
Demographics such as gender, race and age showcased some variation. Older age group more specifically (60-80) tend to dominate the readmission scene, which matches with chronical conditions such as diabetes.
The diagnostic groups helped tremendously with narrowing down most impactfull features. Circulatory and Respiratory diagnoses appear more frequently and subsequently have higher readmission. On the other hand, conditions such as Neoplasms (cancer) suprisngly showed low readmission, and that is likely due to follow-ups being handled by outpatient or via specialized clinic.
The correlation heatmaps confirmed a low correlation among numerical features, indicating low redundancy among features, which is ideal for building a classification model as each feature contributes different information.
These findings established a solid foundation to address the evaluation plan and model selection.
5. Evaluation Plan
As the exploratory data analysis of the target variable revealed a clear class imbalance, with the “NO” readmission class accounting for approximately 50% of the entire dataset, evaluation metrics that go beyond accuracy alone are crucial for properly evaluating model performance. To provide a more comprehensive assessment of model performance across all classes, below are the selected evaluation metrics:
Accuracy: This metric is used to measure the overall correctness of predictions across all samples, proportion of correct predictions. Accuracy is selected as a baseline comparison measure since it is a quick indicator of overall model performance, but can be misleading when classes are imbalanced.
Precision: This metric measures the proportion of predicted positives that were actually correct. Precision for each class (Precision_NO, Precision<30, Precision>30). Evaluates how accurate the model’s predictions are when it predicts a particular class. Precision is important metric when false positives are costly. Thus, this metric is selected because misclassifying a low-risk patient for a high-risk patient is costly for the classification task.
Recall (Sensitivity): This metric measures model’s ability to detect positive cases. Recall for each class (Recall_NO, Recall<30, Recall>30). Assesses the model’s ability to correctly identify actual instances within each class. This metric is chosen because it helps assess the model’s capability to distinguish between different readmission timeframes, which is crucial for clinical decision-making and targeted interventions.
F1 Score: This metric measures the harmonic mean of precision and recall, capturing the balance between the two. F1 Score for each class measures the harmonic mean of Precision and Recall, providing a balanced evaluation for each class. In imbalanced datasets, F1 is more informative than accuracy as it balances the trade-off between precision and recall, giving a more realistic measure of a model’s performance on the minority class.
Macro F1 Score: This metric measures the arithmetic mean of the F1 scores across all three classes, giving equal weight to each class and offering a fair assessment of overall performance. Macro F1 Score treats all classes equally, making it a proper measure of performance for imbalanced dataset, giving insight into whether the model is biased toward the majority class.
Given that this task represents a typical multi-class classification problem with imbalanced class distribution (e.g., NO cases are significantly more common than <30 cases), relying solely on Accuracy is insufficient for a comprehensive evaluation. Therefore, we emphasize per-class Precision, Recall, and F1 scores, which are particularly important for assessing how well the model performs on the high-risk class of early readmission (<30). Additionally, the use of Macro F1 Score helps mitigate the dominance of the majority class and provides a more representative evaluation of model performance.
6. Classification Models
Five classification models are selected for fitting and evaluating the dataset, each suited to the “Diabetes 130-US Hospitals for Years 1999-2008” dataset characteristics and classification goals. These classification models are:
Decision Tree: Decision Tree are well-suited for our diabetes dataset with mixed-type variables and do not require standardization. The model offers high interpretability by allowing traceability of decision paths. It also supports class weighting, which can help improve detection of underrepresented classes such as <30. Evaluation will focus on class-wise recall and macro-averaged F1 score to assess balanced performance.
Random Forest: Random Forest aggregates multiple decision trees to reduce overfitting and better capture complex feature interactions (num_inpatient, num_lab_procedures). It is robust to feature transformations, handles class imbalance effectively, and provides variable importance rankings. Model evaluation includes macro F1, and precision-recall to capture both overall and minority class performance.
Support Vector Machine (SVM): SVM is suitable for our high-dimensional dataset and is capable of learning non-linear decision boundaries via kernel functions. With class weighting enabled(class_weight='balanced'), SVM can better manage imbalance, particularly for the <30 class. Performance will be evaluated using precision, recall, F1, macro F1, AUC, with emphasis on the minority class.
k-Nearest Neighbors (k-NN): k-NN is a distance-based, non-parametric method that relies on similarity across feature space. As it is sensitive to feature scaling, standardization was applied. While k-NN serves as a strong baseline, it may favor majority classes. Therefore, it is used primarily for benchmarking, with evaluation based on per-class recall, per-class precision ,per-class F1, macro F1, Weighted F1 Score and MCC.
Logistic Regression: Logistic regression is appropriate for modeling multi-class classification tasks with interpretable, probability-based outputs. It also supports class weighting, which is helpful for addressing class imbalance. Evaluation metrics includes accuracy, sensitivity, specificity, F1 score, macro F1 and weighted F1 score in both binary and multi-class settings.
7. Innovation
This project team made a deliberate effort to go beyond the project’s requirements and demonstrated innovation across the project plan and EDA. These innovations include: the selection of a challenging and novel dataset that meets all 5 dataset selection guidelines, inclusive and accessible visuals (all plots used color-blind–friendly colors and plots were made interactive with ggplotly to support dynamic exploration), domain-related feature engineering (transformation of the diagnosis features using ICD-9 codes map), and the use of kableExtra library intuitively displaying table results.
8. References
Sharma, Abhishek, Prateek Agrawal, Vishu Madaan, and Shubham Goyal. 2019. “Prediction on Diabetes Patient’s Hospital Readmission Rates.”Proceedings of the Third International Conference on Advanced Informatics for Computing Research, June, 1–5. https://doi.org/10.1145/3339311.3339349.
Strack, Beata, Jonathan P. DeShazo, Chris Gennings, Juan L. Olmo, Sebastian Ventura, Krzysztof J. Cios, and John N. Clore. 2014. “Impact of HbA1c Measurement on Hospital Readmission Rates: Analysis of 70,000 Clinical Database Patient Records.”BioMed Research International 2014: 1–11. https://doi.org/10.1155/2014/781670.
V R Reji Raj, and Mr. Rasheed Ahammed Azad. V. 2023. “Predictive Analysis of Heterogenous Data for Hospital Readmission.”International Journal of Scientific Research in Science, Engineering and Technology, January, 106–12. https://doi.org/10.32628/ijsrset231012.
Source Code
---title: "STAT5003: Project Plan and EDA"subtitle: "Workshop 11 Group 01"date: "13 April 2025"author: - name: "jche0758" affiliation: "SID: 520110054" - name: "lals0119" affiliation: "SID: 540615841" - name: "nals0930" affiliation: "SID: 540927401" - name: "nkha0389" affiliation: "SID: 540829493" - name: "yaff0377" affiliation: "SID: 530784645"format: html: code-fold: true code-tools: true self-contained: trueeditor: visualtoc: truebibliography: references.bib---::: {style="text-align: justify; font-size: 10pt;"}<details><summary>Code</summary>```{r}#load librarieslibrary(tidyverse) #for data science, loads other core librarieslibrary(kableExtra) #for table stylinglibrary(scales) #for scaling axeslibrary(reshape2) #for reshaping datalibrary(fmsb) #for radar/spider chartslibrary(patchwork) #for combining ggplotslibrary(plotly) #for interactive ggplots``````{r}#load the datasetdiabetic_data <-read_csv("diabetic_data.csv", show_col_types =FALSE)```</details>:::### 1. Problem Definition::: {style="text-align: justify; font-size: 10pt;"}Hospital readmissions are a critical healthcare problem due to the significant impact they have on patient's health, healthcare costs, and the overall efficiency of the healthcare system. Hospital readmissions, primarily those occurring within 30 days of discharge, are a key indicator of the healthcare quality provided to patients with chronic conditions like diabetes. Predicting diabetic patient readmissions is significant for improving the overall healthcare quality and implementing proper post-discharge support and intervention plans, ultimately improving patient's long-term health and reducing unnecessary healthcare costs [@sharma2019].The classification task this project aims to tackle is **to predict diabetic patient readmission status within 30 days of discharge**, using data collected from 130 United States (US) hospitals between 1999 and 2008. The dataset consists of various attributes on patient demographics, medical and hospitalization records, and treatment procedures, providing details on the factors contributing to patient readmission status [@strack2014a]. This is a multi-class classification task, where the target variable readmitted $y$ has three distinct classes.$$y = \begin{cases} <30 & \text{(patient was readmitted within 30 days)} \\>30 & \text{(patient was readmitted after 30 days)} \\\text{No} & \text{(patient was not readmitted)} \end{cases}$$Developing a classification model to predict diabetic patient readmission status will contribute to improving healthcare quality, improving patient health and long-term health outcomes, avoiding unnecessary readmission costs, supporting clinical decision-making, and improving hospital's operational efficiency [@vrrejiraj2023].:::### 2. Data Description::: {style="text-align: justify; font-size: 10pt;"}The dataset used in this assignment titled "Diabetes 130-US Hospitals for Years 1999-2008" was obtained from the Health Facts database, a national warehouse that collects comprehensive clinical records from hospitals across the US [@strack2014a]. The raw dataset contains **101,767 records** and **50 attributes**, collected from **130 hospitals** between 1999 and 2008. The dataset includes **categorical features** representing patient and hospital detailes, such as race, gender, medical specialty, diagnoses, **23 medication-related features** (e.g., metformin, insulin, etc.), and **numerical attributes** such as patient number, admission type and number of lab procedures.:::::: {style="text-align: justify; font-size: 10pt;"}<details><summary>Data Dictionary</summary>```{r}#create table for dataset features name and typedatatype_tbl <-tibble(Variable =names(diabetic_data),Type =sapply(diabetic_data, class))#split into variable-type pairspairs <- datatype_tbl %>%mutate(Pair =map2(Variable, Type, ~c(.x, .y))) %>%pull(Pair) %>%unlist()#set a table of 6 columns for better displaynum_cols <-6rows_needed <-ceiling(length(pairs) / num_cols)length(pairs) <- rows_needed * num_cols #set column names and conver to matrixcol_names <-rep(c("Variable", "Type"), num_cols /2)pair_matrix <-matrix(pairs, ncol = num_cols, byrow =TRUE)#display table using kablekable(pair_matrix, col.names = col_names, escape =FALSE) %>%kable_styling(full_width =FALSE)```</details>:::::: {style="text-align: justify; font-size: 10pt;"}The Diabetes 130-US Hospitals for Years 1999-2008 dataset is very challenging, as it meets **all** the *Dataset Selection Guidelines* stated in the project requirements.- **Large Size**: The size of the dataset is large, as it contains over **100,000 records**.- **Messy**: The dataset is considered messy, with more than **192,000 missing values** identified across multiple features.- **Complex**: The dataset is considerably complex, as it contains a **mix of categorical and numerical attributes** representing complex relationships between multiple health factors.- **Integration**: The data was integrated from different data sources that included **130 hospitals**.- **Multi-class**: The dataset involves a **multi-class classification task** based on the readmitted attribute (\<30 if the patient was readmitted within 30 days, \>30 if readmitted after 30 days, and No if the patient was not readmitted).These aspects highlight the dataset's complexity and suitability for evaluating real-world clinical scenarios using computational statistical models.:::### 3. Clean and Prepare##### **Load, Understand, and Categorize Data**::: {style="text-align: justify; font-size: 10pt;"}<details><summary>Code</summary>```{r}#get dimensions of the datasetdim(diabetic_data)#display structure of the datasetstr(diabetic_data)#dummary statistics for each columnsummary(diabetic_data)```</details>:::::: {style="text-align: justify; font-size: 10pt;"}The dataset named "diabetic_data.csv" was loaded using `read_csv()` with `show_col_types = FALSE` to suppress the warning message that appears when column data types are automatically inferred. `str()` function was used to examine the structure of the dataset, whereas `summary()` function was used to provide key descriptive statistics such as minimum, median, mean, and maximum values. From the functions output it was observed that most of the diabetes dataset features fall into two main data types:- **Numeric:** These include hospitalization duration, lab test results, medications, and patient counts (e.g., time_in_hospital, num_procedures, num_medications).- **Character:** These include categorical attributes such as race, gender, age, diagnosis codes (diag_1, diag_2, diag_3), and medication-related features like insulin.:::##### **Identify and Handle Missing Values**::: {style="text-align: justify; font-size: 10pt;"}Once a general understanding was gained on the dataset, the dataset was examined for missing and null values. The dataset was first scanned for standard NA valus using `is.na()` function. This check returned zero missing values, indicating that missing data might have been recorded in alternative forms. Upon further inspection of the dataset, it was found that missing values were represented in various non-standard forms such as "?" and "Unknown/Invalid". These are often used to indicate missing or uncertain information. To systematically identify such entries, a loop was implemented to scan each column and count the occurrences of these values. After conducting a second-level inspection for missing values, eight features were identified with non-standard missing entries. These were primarily represented by question marks "?" and a few "Unknown/Invalid" values. The figure below displays the missing values distribution, all these missing values were replaced with NA for uniformity.:::::: {style="text-align: justify; font-size: 10pt;"}<details><summary>Code</summary>```{r}#check for actual NA values in each columnna_counts <-colSums(is.na(diabetic_data))#convert to a data framena_summary <-enframe(na_counts, name ="Column", value ="NA_Count")#display table using kablekable(na_summary, caption ="Standard Missing Values (NA) per Column") %>%kable_styling(full_width =FALSE, position ="center")``````{r}#create empty vectors to store resultscolumn_names <-c()question_marks <-c()empty_strings <-c()unknowns <-c()unknown_invalids <-c()#loop through each columnfor (i in1:ncol(diabetic_data)) { col_name <-colnames(diabetic_data)[i] col_data <- diabetic_data[[i]] #count each missing type qmark <-sum(col_data =="?", na.rm =TRUE) empty <-sum(col_data =="", na.rm =TRUE) unk <-sum(col_data =="Unknown", na.rm =TRUE) unk_inv <-sum(col_data =="Unknown/Invalid", na.rm =TRUE)#only record if at least one missing-like value existsif (qmark + empty + unk + unk_inv >0) { column_names <-c(column_names, col_name) question_marks <-c(question_marks, qmark) empty_strings <-c(empty_strings, empty) unknowns <-c(unknowns, unk) unknown_invalids <-c(unknown_invalids, unk_inv) }}#combine all into one data frame (after the loop)missing_summary <-data.frame(Column = column_names,Question_Mark = question_marks,Empty_String = empty_strings,Unknown = unknowns,Unknown_Invalid = unknown_invalids)``````{r}#display table using kablekable(missing_summary, align ="lcccc", caption ="Summary of Missing Values") %>%kable_styling(full_width =FALSE, position ="center")```</details>```{r}#prepare data for ggplotmissing_long <- missing_summary %>%pivot_longer(cols =c(Question_Mark, Empty_String, Unknown, Unknown_Invalid),names_to ="Type",values_to ="Count" )#plot missing value countp <-ggplot(missing_long, aes(x =reorder(Column, -Count), y = Count, fill = Type)) +geom_bar(stat ="identity", position ="dodge") +labs(title ="Summary of Missing Values Counts",x ="Feature",y ="Missing Count",fill ="Missing Type" ) +scale_fill_viridis_d(option ="D", begin =0.1, end =0.9) +#color-blind friendlytheme_minimal(base_size =10) +theme(axis.text.x =element_text(angle =40, hjust =1),plot.title =element_text(face ="bold", size =10, hjust =0.5))#make plot interactiveggplotly(p)```:::::: {style="text-align: justify; font-size: 10pt;"}<details><summary>Code</summary>```{r}#replace "?" with NAdiabetic_data[diabetic_data =="?"] <-NA#replace "Unknown/Invalid" with NAdiabetic_data[diabetic_data =="Unknown/Invalid"] <-NA```</details>:::::: {style="text-align: justify; font-size: 10pt;"}The following features were removed for the dataset as each of these features has more than **40% missing values.** Additionally, these features are not not essential for building a reliable predictive model nor impact the classification task to predict diabetic patient readmission status within 30 days of discharge.- **weight** represents the patient's body weight. More than 97% of weight values are missing. Due to its lack of completeness, it is not practical to include this feature.- **payer_code** refers to the patient's method of payment or insurance with approximately 40% missing data. This feature is more administrative than predictive and has minimal relevance to the classification goal.- **medical_specialty** indicates the specialty of the admitting physician, with nearly 50% of missing values. Additionally, this feature contains a large number of inconsistent and sparse categories, which will not be beneficial during model building and training.Since the dataset is already large and only a reletvily small number of observations had missing values in important features like race, gender, and diagnosis codes, it was decided to remove those observations reduceding the dataset from 101,766 to 98,052 observations, keeping only complete records for analysis. Additionally, the features encounter_id (a unique ID for each hospital visit) and patient_nbr (a unique ID for each patient) were removed since these two features do not provide any useful information for predicting diabetic patient readmission status.:::::: {style="text-align: justify; font-size: 10pt;"}<details><summary>Code</summary>```{r}#remove the weight, payer_code, medical_specialty columndiabetic_data <- diabetic_data %>%select(-c(weight, payer_code, medical_specialty))dim(diabetic_data)``````{r}#remove rows that contain any NA valuesdiabetic_data <-na.omit(diabetic_data)dim(diabetic_data)``````{r}#remove the encounter ID and patient number columnsdiabetic_data <- diabetic_data %>%select(-c(encounter_id, patient_nbr ))dim(diabetic_data)```</details>:::##### **Features Engineering for Categorical Features**::: {style="text-align: justify; font-size: 10pt;"}The number of unique values in each column are counted to identify features with high cardinality. Most medication-related features had only 2–4 unique values, making them straightforward to include in the model. However, the diagnosis features (diag_1, diag_2, and diag_3) contained hundreds of unique ICD-9 codes, the official system of assigning codes to diagnoses associated with hospital in the US [@strack2014a]. Thus, using the diagnosis codes directly would be complex and may lead to overfitting. As these codes are highly uninterpretable, they were **mapped** into broader disease groups, following guidelines from the reference documentation [@strack2014a]. This transformation preserves the medical meaning while reducing the number of categories. For example, codes between 390–459 and 785 were mapped to "Circulatory", while codes from 460–519 and 786 were grouped under "Respiratory". Some codes such as "V45" did not fall into any of the defined ICD-9 groupings, thus were assigned the group "Unknown" to preserve them for further review.:::::: {style="text-align: justify; font-size: 10pt;"}<details><summary>Code</summary>```{r}for (col_name incolnames(diabetic_data)) {print(col_name)print(length(unique(diabetic_data[[col_name]])))}``````{r}# === diag_1 ===diabetic_data$diag_1_prefix <-substr(diabetic_data$diag_1, 1, 3)diabetic_data$diag_1_num <-as.numeric(diabetic_data$diag_1_prefix)diabetic_data$diag_1_group <-NAdiabetic_data$diag_1_group[diabetic_data$diag_1_num >=390& diabetic_data$diag_1_num <=459| diabetic_data$diag_1_num ==785] <-"Circulatory"diabetic_data$diag_1_group[diabetic_data$diag_1_num >=460& diabetic_data$diag_1_num <=519| diabetic_data$diag_1_num ==786] <-"Respiratory"diabetic_data$diag_1_group[diabetic_data$diag_1_num >=520& diabetic_data$diag_1_num <=579| diabetic_data$diag_1_num ==787] <-"Digestive"diabetic_data$diag_1_group[diabetic_data$diag_1_num >=250& diabetic_data$diag_1_num <251] <-"Diabetes"diabetic_data$diag_1_group[diabetic_data$diag_1_num >=800& diabetic_data$diag_1_num <=999] <-"Injury"diabetic_data$diag_1_group[diabetic_data$diag_1_num >=710& diabetic_data$diag_1_num <=739] <-"Musculoskeletal"diabetic_data$diag_1_group[diabetic_data$diag_1_num >=580& diabetic_data$diag_1_num <=629| diabetic_data$diag_1_num ==788] <-"Genitourinary"diabetic_data$diag_1_group[diabetic_data$diag_1_num >=140& diabetic_data$diag_1_num <=239] <-"Neoplasms"diabetic_data$diag_1_group[diabetic_data$diag_1_num >=1& diabetic_data$diag_1_num <=139] <-"Infectious"diabetic_data$diag_1_group[diabetic_data$diag_1_num >=290& diabetic_data$diag_1_num <=319] <-"Mental Disorders"diabetic_data$diag_1_group[diabetic_data$diag_1_num %in%c(780, 781, 784) | (diabetic_data$diag_1_num >=790& diabetic_data$diag_1_num <=799)] <-"Other"diabetic_data$diag_1_group[is.na(diabetic_data$diag_1_group)] <-"Unknown"#mark any unmatched code as unknown# === diag_2 ===diabetic_data$diag_2_prefix <-substr(diabetic_data$diag_2, 1, 3)diabetic_data$diag_2_num <-as.numeric(diabetic_data$diag_2_prefix)diabetic_data$diag_2_group <-NAdiabetic_data$diag_2_group[diabetic_data$diag_2_num >=390& diabetic_data$diag_2_num <=459| diabetic_data$diag_2_num ==785] <-"Circulatory"diabetic_data$diag_2_group[diabetic_data$diag_2_num >=460& diabetic_data$diag_2_num <=519| diabetic_data$diag_2_num ==786] <-"Respiratory"diabetic_data$diag_2_group[diabetic_data$diag_2_num >=520& diabetic_data$diag_2_num <=579| diabetic_data$diag_2_num ==787] <-"Digestive"diabetic_data$diag_2_group[diabetic_data$diag_2_num >=250& diabetic_data$diag_2_num <251] <-"Diabetes"diabetic_data$diag_2_group[diabetic_data$diag_2_num >=800& diabetic_data$diag_2_num <=999] <-"Injury"diabetic_data$diag_2_group[diabetic_data$diag_2_num >=710& diabetic_data$diag_2_num <=739] <-"Musculoskeletal"diabetic_data$diag_2_group[diabetic_data$diag_2_num >=580& diabetic_data$diag_2_num <=629| diabetic_data$diag_2_num ==788] <-"Genitourinary"diabetic_data$diag_2_group[diabetic_data$diag_2_num >=140& diabetic_data$diag_2_num <=239] <-"Neoplasms"diabetic_data$diag_2_group[diabetic_data$diag_2_num >=1& diabetic_data$diag_2_num <=139] <-"Infectious"diabetic_data$diag_2_group[diabetic_data$diag_2_num >=290& diabetic_data$diag_2_num <=319] <-"Mental Disorders"diabetic_data$diag_2_group[diabetic_data$diag_2_num %in%c(780, 781, 784) | (diabetic_data$diag_2_num >=790& diabetic_data$diag_2_num <=799)] <-"Other"diabetic_data$diag_2_group[is.na(diabetic_data$diag_2_group)] <-"Unknown"# === diag_3 ===diabetic_data$diag_3_prefix <-substr(diabetic_data$diag_3, 1, 3)diabetic_data$diag_3_num <-as.numeric(diabetic_data$diag_3_prefix)diabetic_data$diag_3_group <-NAdiabetic_data$diag_3_group[diabetic_data$diag_3_num >=390& diabetic_data$diag_3_num <=459| diabetic_data$diag_3_num ==785] <-"Circulatory"diabetic_data$diag_3_group[diabetic_data$diag_3_num >=460& diabetic_data$diag_3_num <=519| diabetic_data$diag_3_num ==786] <-"Respiratory"diabetic_data$diag_3_group[diabetic_data$diag_3_num >=520& diabetic_data$diag_3_num <=579| diabetic_data$diag_3_num ==787] <-"Digestive"diabetic_data$diag_3_group[diabetic_data$diag_3_num >=250& diabetic_data$diag_3_num <251] <-"Diabetes"diabetic_data$diag_3_group[diabetic_data$diag_3_num >=800& diabetic_data$diag_3_num <=999] <-"Injury"diabetic_data$diag_3_group[diabetic_data$diag_3_num >=710& diabetic_data$diag_3_num <=739] <-"Musculoskeletal"diabetic_data$diag_3_group[diabetic_data$diag_3_num >=580& diabetic_data$diag_3_num <=629| diabetic_data$diag_3_num ==788] <-"Genitourinary"diabetic_data$diag_3_group[diabetic_data$diag_3_num >=140& diabetic_data$diag_3_num <=239] <-"Neoplasms"diabetic_data$diag_3_group[diabetic_data$diag_3_num >=1& diabetic_data$diag_3_num <=139] <-"Infectious"diabetic_data$diag_3_group[diabetic_data$diag_3_num >=290& diabetic_data$diag_3_num <=319] <-"Mental Disorders"diabetic_data$diag_3_group[diabetic_data$diag_3_num %in%c(780, 781, 784) | (diabetic_data$diag_3_num >=790& diabetic_data$diag_3_num <=799)] <-"Other"diabetic_data$diag_3_group[is.na(diabetic_data$diag_3_group)] <-"Unknown"dim(diabetic_data)```</details>:::::: {style="text-align: justify; font-size: 10pt;"}The original diagnosis featuers diag_1, diag_2, and diag_3 were removed after new diagnosis group feature was engineered. This transformation helps reduce high cardinality and potential noise, making the data more suitable for building models. The intermediate columns used during the transformation (such as prefixes and numeric conversions) were also dropped to keep the dataset clean and focused on meaningful features. The diagnosis values that could not be mapped to ICD-9 categories and were labeled as "Unknown", lacked clinical meaning and were thus removed from the analysis. Removing these observations helps ensure cleaner inputs for modeling and reduces the risk of introducing noise or ambiguity during model training. As a result, the dataset was reduced from 98,052 to 56,673 observations.:::::: {style="text-align: justify; font-size: 10pt;"}<details><summary>Code</summary>```{r}#drop original diagnosis columns: diag_1, diag_2, diag_3diabetic_data <- diabetic_data %>%select(-c(diag_1, diag_2, diag_3))#drop helper columns used for processingdiabetic_data <- diabetic_data %>%select(-c(diag_1_prefix, diag_2_prefix, diag_3_prefix))diabetic_data <- diabetic_data %>%select(-c(diag_1_num, diag_2_num, diag_3_num))dim(diabetic_data)``````{r}diabetic_data$diag_1_group[diabetic_data$diag_1_group =="Unknown"] <-NAdiabetic_data$diag_2_group[diabetic_data$diag_2_group =="Unknown"] <-NAdiabetic_data$diag_3_group[diabetic_data$diag_3_group =="Unknown"] <-NAdiabetic_data <-na.omit(diabetic_data)#check new dimensionsdim(diabetic_data)```</details>:::::: {style="text-align: justify; font-size: 10pt;"}Based on the summary of unique values and a review of the dataset, two features were further dropped: discharge_disposition_id (26 unique values) and admission_source_id (15 unique values). Although these columns are stored as numeric values, they actually represent nominal categories (e.g., source of admission, discharge status) . Keeping these features would require encoding them into many sub-features using one-hot encoding. This could increase dimensionality and make the model more complex, especially since we don’t have labels explaining what each ID value represents. More importantly, using them directly in models (especially linear models or distance-based models like KNN) can confuse the model, as it might interpret the numerical IDs incorrectly. For example, it may think that ID 4 is twice as important as ID 2, which is not true. Furthermore, all character-type features were converted to factors to ensure categorical variables are properly recognized and interpreted by statistical models.:::::: {style="text-align: justify; font-size: 10pt;"}<details><summary>Code</summary>```{r}diabetic_data <- diabetic_data %>%select(-c(discharge_disposition_id, admission_source_id))dim(diabetic_data)``````{r}#convert all character columns to factorsfor (col innames(diabetic_data)) {if (class(diabetic_data[[col]]) =="character") { diabetic_data[[col]] <-as.factor(diabetic_data[[col]]) }}#show which columns are now factorsstr(diabetic_data)```</details>:::##### **Identify and Handle Outliers**::: {style="text-align: justify; font-size: 10pt;"}To further clean and prepare the dataset for modeling, dataset outliers shall be identified and appropriately handled to reduce outliers impact on model training, model performance, and improve the robustness of predictive results. The dataset numerical features were examined for potential outliers using the Interquartile Range $IQR$ method. This method uses IQR to count the number of outliers in each numeric feature and stores the results in a data-frame for further analysis and visualization. The numerical features in the dataset were first identified using `is.numeric()` function. Each numeric feature is processed through the $IQR$ outlier detection method: calculating the first quartile $Q1$, third quartile $Q3$, the lower and upper bounds (defined as $Q1 -1.5 × IQR$ and $Q3 + 1.5 × IQR$, respectively).The boxplots below provides a comparative view of the numerical features with outliers that were identified by the IQR method. Features such as num_lab_procedures, num_medications, and various visits (number_emergency, number_inpatient, number_outpatient) exhibit a clear right-skewedness distribution with a large number of high-end outliers. These features may need special treatment before normalization or modeling to avoid bias in the results.:::::: {style="text-align: justify; font-size: 10pt;"}<details><summary>Code</summary>```{r}#identify numeric columnsnumeric_cols <- diabetic_data %>%select(where(is.numeric))#loop over numeric columns and calculate outliers using IQRoutlier_summary <- numeric_cols %>%map_df(~{ Q1 <-quantile(.x, 0.25) Q3 <-quantile(.x, 0.75) IQR <- Q3 - Q1 lower_bound <- Q1 -1.5* IQR upper_bound <- Q3 +1.5* IQR outlier_count <-sum(.x < lower_bound | .x > upper_bound)tibble(Outlier_Count = outlier_count) }, .id ="Feature") %>%arrange(desc(Outlier_Count))``````{r}#display table using kablekable(outlier_summary, caption ="Numerical Features Outliers (Using IQR)") %>%kable_styling(full_width =FALSE, position ="center")```</details>```{r, fig.align='center'}#reshape to long format for ggplotdiabetic_data_long <- diabetic_data[, outlier_summary$Feature] %>% mutate(row_id = row_number()) %>% pivot_longer(cols = -row_id, names_to = "Feature", values_to = "Value")#plot boxplot using ggplotp <- ggplot(diabetic_data_long, aes(x = Feature, y = Value, color = Feature)) + geom_boxplot() + theme_minimal(base_size = 10) + theme( axis.text.y = element_text(size = 10), axis.text.x = element_text(angle = 40), plot.title = element_text(face = "bold", size = 10, hjust = 0.3), legend.position = "none" ) + scale_y_continuous(expand = expansion(mult = c(0.05, 0.1))) + labs( title = "Numeric Features with Outliers", x = "Feature", y = "Value" )#make plot interactiveggplotly(p)```:::::: {style="text-align: justify; font-size: 10pt;"}To handle outliers, we iterated over all numeric variables and marked observations containing an outlier for easier identification. Since normalization (like Min-Max scaling) is highly sensitive to extreme values the presence of outliers in the dataset will have an extreme impact. If we normalize before removing outliers, the min and max values will be skewed, and most of the data will be squished into a narrow range near 0, making it harder for the model to learn from the majority of the data. Thus, observations with outliers values are removed resulting in a dataset with 39,304 observations and 43 features.:::::: {style="text-align: justify; font-size: 10pt;"}<details><summary>Code</summary>```{r}#create a logical index of rows to keep (initialize with TRUEs)keep_rows <-rep(TRUE, nrow(diabetic_data))#loop over numeric columns and mark rows with outliersfor (col innames(numeric_cols)){ values <- diabetic_data[[col]] Q1 <-quantile(values, 0.25) Q3 <-quantile(values, 0.75) IQR <- Q3 - Q1 lower_bound <- Q1 -1.5* IQR upper_bound <- Q3 +1.5* IQR#identify outliers for this feature outlier_mask <- values < lower_bound | values > upper_bound#mark FALSE for rows that are outliers in this column keep_rows <- keep_rows &!outlier_mask}#apply the filter oncediabetic_data <- diabetic_data[keep_rows, ]#check new dimensionsdim(diabetic_data)```</details>:::##### **Normalization**::: {style="text-align: justify; font-size: 10pt;"}Min-Max normalization is applied to selected numeric features as they have different scales and large ranges, which could unfairly influence models that are sensitive to feature magnitude, such as KNN or logistic regression. Based on the output of the summary() function, it was observed that the following six features had relatively high maximum values or wide ranges: num_lab_procedures, num_medications, number_outpatient, number_emergency, number_inpatient, and number_diagnoses. These features were normalized using Min-Max scaling, which transforms their values to fall between 0 and 1. This ensures that no single feature dominates the learning process due to having larger numeric values. Features that had small ranges or represented categorical values (such as admission_type_id) were left unchanged, as normalization is not meaningful or necessary in those cases.:::::: {style="text-align: justify; font-size: 10pt;"}<details><summary>Code</summary>```{r}normalize <-function(x) {return((x -min(x)) / (max(x) -min(x)))}#apply normalization to selected featuresdiabetic_data$num_lab_procedures <-normalize(diabetic_data$num_lab_procedures)diabetic_data$num_medications <-normalize(diabetic_data$num_medications)diabetic_data$number_outpatient <-normalize(diabetic_data$number_outpatient)diabetic_data$number_emergency <-normalize(diabetic_data$number_emergency)diabetic_data$number_inpatient <-normalize(diabetic_data$number_inpatient)diabetic_data$number_diagnoses <-normalize(diabetic_data$number_diagnoses)```</details>:::### 4. Explore and Visualize::: {style="text-align: justify; font-size: 10pt;"}Exploratory Data Analysis (EDA) has been conducted on the cleaned dataset to gain a comprehensive insights into the dataset, uncover patterns, and identify modeling strategies. The analysis focuses the target variable readmitted and the dataset features that are potentially associated with readmission status, contributing to the classification goal to predict diabetic patient readmission status within 30 days of discharge.:::##### **Target Variable Distribution**::: {style="text-align: justify; font-size: 10pt;"}The bar chart below visualizes the distribution of the target variable readmitted. It can be noted that more than 50% of patients were not readmitted, around 30% of patients were readmitted after 30 days, and only 10% were readmitted within 30 days. The bar chart reveals a significant class imbalance, where the high-risk class readmitted within 30 days (\<30) being the minority class. This class imbalance highlights the need for class weights adjustment during modeling to improve classifier sensitivity.:::::: {style="text-align: justify; font-size: 10pt;"}<details><summary>Code</summary>```{r}#summary table with three columns and rename them as followreadmit_dist <- diabetic_data %>% dplyr::count(readmitted, name ="Total_Patients") %>% dplyr::mutate(Proportion = Total_Patients /sum(Total_Patients),Percentage =percent(Proportion, accuracy =1) ) %>%rename(`Readmission Category`= readmitted) %>%select(`Readmission Category`, Total_Patients, Percentage)```</details>```{r, fig.align='center', fig.width=6, fig.height=4}#plot readmission class distributionp <- ggplot(readmit_dist, aes(x = reorder(`Readmission Category`, -Total_Patients), y = Total_Patients, fill = `Readmission Category`)) + geom_col(width = 0.6, show.legend = FALSE) + geom_text( aes(label = paste0(Percentage, "\n(", comma(Total_Patients), ")")), vjust = -0.3, size = 2, fontface = "bold" ) + scale_fill_viridis_d(option = "D", begin = 0.1, end = 0.9) + #color-blind friendly scale_y_continuous(labels = comma, expand = expansion(mult = c(0, 0.1))) + labs( title = "Distribution of Readmission Status", x = "Readmission Category", y = "Number of Patients" ) + theme_minimal(base_size = 10)+ theme( plot.title = element_text(face = "bold", size = 10, hjust = 0.5))#make plot interactiveggplotly(p)```:::##### **Patients Demographic Composition**::: {style="text-align: justify; font-size: 10pt;"}Patients demographic features such as gender, age, and race were analyzed to uncover readmission patterns and understand where the bulk of readmissions is coming from. The figure below highlights that readmission status are relatively balanced between males and females, with both gender having similar readmission patterns, indicating no major impact. Looking at the age groups the figure highlights most patients are between 50 and 80 years old, where we see the highest readmission counts can be noted, indicating that age is a possible factor for impacting readmission status. The higher counts within caucasian patients reflects population volume rather than higher risk of readmission, as caucasian patients make up the largest racial group within the dataset; therefore, proportions analysis was also conducted to provide a more accurate comparison.:::::: {style="text-align: justify; font-size: 10pt;"}```{r, fig.align='center'}#gender plotp1 <- ggplot(diabetic_data, aes(x = gender, fill = readmitted)) + geom_bar(position = "dodge") + labs( x = "Gender", fill = "Readmitted" ) + scale_fill_viridis_d(option = "D", begin = 0.1, end = 0.9) + #color-blind friendly theme_minimal(base_size = 10) + theme( plot.title = element_text(face = "bold", hjust = 0.2), legend.position = "right" )#age plotp2 <- ggplot(diabetic_data, aes(x = age, fill = readmitted)) + geom_bar(position = "dodge") + labs( x = "Age", fill = "Readmitted" ) + scale_fill_viridis_d(option = "D", begin = 0.1, end = 0.9) + #color-blind friendly theme_minimal(base_size = 10) + theme( plot.title = element_text(face = "bold", hjust = 0.2), axis.text.x = element_text(angle = 40, hjust = 0.2) )#race plotp3 <- ggplot(diabetic_data, aes(x = race, fill = readmitted)) + geom_bar(position = "dodge") + labs( x = "Race", fill = "Readmitted" ) + scale_fill_viridis_d(option = "D", begin = 0.1, end = 0.9) + #color-blind friendly theme_minimal(base_size = 10) + theme( plot.title = element_text(face = "bold", hjust = 0.2), axis.text.x = element_text(angle = 40, hjust = 0.2) )#combine plots horizontally and make plot interactivesubplot_combined <- subplot( ggplotly(p1), ggplotly(p2) %>% layout(showlegend = FALSE), ggplotly(p3) %>% layout(showlegend = FALSE), nrows = 1, shareX = FALSE, shareY = FALSE, titleX = TRUE, titleY = FALSE) %>% layout( title = list( text = "Patient Readmissions by Gender, Age, and Race", x = 0.5, xanchor = "center", font = list(size = 16, family = "Arial", color = "#333333") ), annotations = list( list( text = "Number of Patients", x = 0, xref = "paper", y = 0.5, yref = "paper", showarrow = FALSE, font = list(size = 12), textangle = -90 ) ) )#display the final resultsubplot_combined```:::##### **Patient Profile Comparison**::: {style="text-align: justify; font-size: 10pt;"}The radar chart below provides a comparative overview of patient profiles across the three readmission classes. Patients who were readmitted within 30 days exhibit higher counts across most variables, including number of medications, length of hospital stay, emergency room visits, and inpatient encounters, suggesting a more complicated medical profile. In contrast, patients who were not readmitted generally demonstrate lower counts across most variables, with slight exception on the number of procedures, which may reflect more planned preventive care or early interventions rather than medical complications.:::::: {style="text-align: justify; font-size: 10pt;"}```{r, fig.align='center'}#get the average profile for each readmission groupradar_data <- diabetic_data %>% group_by(readmitted) %>% summarise( Medications = mean(num_medications, na.rm = TRUE), Procedures = mean(num_procedures, na.rm = TRUE), TimeInHospital = mean(time_in_hospital, na.rm = TRUE), ERVisits = mean(number_emergency, na.rm = TRUE), InpatientVisits = mean(number_inpatient, na.rm = TRUE), OutpatientVisits = mean(number_outpatient, na.rm = TRUE), .groups = "drop" )#fmsb needs the first two rows to define the range (max + min) of the axesradar_chart <- rbind( apply(radar_data[,-1], 2, max), apply(radar_data[,-1], 2, min), radar_data[,-1])#convert to numericradar_chart <- as.data.frame(lapply(radar_chart, as.numeric))rownames(radar_chart) <- c("Max", "Min", radar_data$readmitted)#set custom color-blind friendly colorscustom_colors <- c("#21908C", "#440154", "#5DC863")colors_fill <- scales::alpha(custom_colors, 0.3)#plot radar chartradarchart( radar_chart, axistype = 1, pcol = custom_colors, pfcol = colors_fill, plwd = 2, plty = 1, cglcol = "grey", cglty = 1, axislabcol = "grey30", vlcex = 0.85, title = "Patient Profile Comparison by Readmission Status")#add a legend to keep it readablelegend("topright", legend = radar_data$readmitted, bty = "n", pch = 20, col = custom_colors, text.col = "black", cex = 0.9)```:::##### **Diagnosis-Based Readmission Patterns**::: {style="text-align: justify; font-size: 10pt;"}The feature engineering carried out on the diagnosis codes feature, in the clean and prepare phase, facilitated an interpretable analysis on the impact of the diagnosis categories on the patient readmission status. The chart shows the distribution of diagnosis categories the three diagnosis levels (diag_1, diag_2, and diag_3), grouped by readmission status. Circulatory and Other conditions are the most frequent across all diagnosis levels, especially in the primary diagnosis (diag_1). In contrast, conditions like Diabetes and Neoplasms are more frequently recorded as secondary or tertiary issues, suggesting their significant impact on patient readmission status. Overall, this visualization provides insights on the underlying medical conditions that are possibly associated with hospital readmission, uncovering readmission pattrens.:::::: {style="text-align: justify; font-size: 10pt;"}```{r, fig.align='center'}#combine diagnosis group variables for plottingdiag_long <- diabetic_data %>% select(readmitted, diag_1_group, diag_2_group, diag_3_group) %>% pivot_longer(cols = starts_with("diag_"), names_to = "Diagnosis_Level", values_to = "Diagnosis_Group")#clean label namesdiag_long$Diagnosis_Level <- recode(diag_long$Diagnosis_Level, diag_1_group = "Diagnosis 1", diag_2_group = "Diagnosis 2", diag_3_group = "Diagnosis 3")#plot bar chartsp <- ggplot(diag_long, aes(x = fct_infreq(Diagnosis_Group), fill = readmitted)) + geom_bar(position = "dodge") + scale_fill_viridis_d(option = "D", begin = 0.1, end = 0.9) + #color-blind friendly labs( title = "Readmission Count by Diagnosis Level (diag_1, diag_2, diag_3)", x = "Diagnosis Group", y = "Number of Patients", fill = "Readmitted" ) + facet_wrap(~ Diagnosis_Level, ncol = 1, scales = "free_x") + theme_minimal(base_size = 10) + theme( axis.text.x = element_text(angle = 35, hjust = 1, face = "bold"), strip.text = element_text(face = "bold"), legend.position = "right" )ggplotly(p, height = 600)```:::##### **Correlation Between Numeric Variables**::: {style="text-align: justify; font-size: 10pt;"}The heatmap below presents the correlation matrix for the numeric features in the dataset, offering a snapshot of how these numeric features are correlated within the dataset. Overall, the correlations revealed that most numeric features are moderately correlated, indicating that each numeric features provide different types of information rather than association. key observations include:- A moderate positive correlation between time_in_hospital and both num_lab_procedures (0.33) and num_medications (0.43), indicating longer hospital stays are associated with more procedures and medication.- A low positive correlation between between most features, such as number_inpatient and num_procedures, indicating their independent value.The correlation matrix supported the inclusion the dataset numeric features in the classifier modeling, as they appear to contribute unique information that supports the underlining classification goal to predict diabetic patient readmission status within 30 days of discharge.:::::: {style="text-align: justify; font-size: 10pt;"}```{r, fig.align='center', fig.width=6, fig.height=4}#identify numeric columnsnumeric_vars <- diabetic_data[sapply(diabetic_data, is.numeric)]numeric_vars <- numeric_vars[, colSums(!is.na(numeric_vars)) > 0]#prepare correlation matrixcor_matrix <- cor(numeric_vars, use = "complete.obs")cor_df <- melt(cor_matrix)#base heatmap with better visual harmonyp <- ggplot(cor_df, aes(x = Var2, y = Var1, fill = value)) + geom_tile(color = "white") + geom_text(aes(label = round(value, 2)), color = "black", size = 3.5) + scale_fill_gradient2( low = "#440154", # red for negative mid = "white", # neutral high = "#21908C", # green for positive midpoint = 0, limits = c(-1, 1), name = "Correlation" ) + labs( title = "Correlation Between Patient Numeric Features", x = NULL, y = NULL ) + theme_minimal(base_size = 10) + theme( axis.text.x = element_text(angle = 45, hjust = 1, face = "bold"), axis.text.y = element_text(face = "bold"), legend.title = element_text(face = "bold"), plot.title = element_text(face = "bold", hjust = 0.5) )ggplotly(p)```:::##### **EDA Insights and Summary**::: {style="text-align: justify; font-size: 10pt;"}The EDA provided valuable insights into the factors that most likely to influence hospital readmission among diabetic patients. Though some findings confirmed our initial expectations of the underlying dataset, others revealed more insightful patterns.- The target variable readmission revealed that majority of patient were not readmitted or were readmitted after 30 days. A smaller subset of patients, yet medically significant, were readmitted within 30 days, which confirmed a class imbalance that will be accounted for during modeling stage.- Demographics such as gender, race and age showcased some variation. Older age group more specifically (60-80) tend to dominate the readmission scene, which matches with chronical conditions such as diabetes.- The diagnostic groups helped tremendously with narrowing down most impactfull features. Circulatory and Respiratory diagnoses appear more frequently and subsequently have higher readmission. On the other hand, conditions such as Neoplasms (cancer) suprisngly showed low readmission, and that is likely due to follow-ups being handled by outpatient or via specialized clinic.- The correlation heatmaps confirmed a low correlation among numerical features, indicating low redundancy among features, which is ideal for building a classification model as each feature contributes different information.These findings established a solid foundation to address the evaluation plan and model selection.:::### 5. Evaluation Plan::: {style="text-align: justify; font-size: 10pt;"}As the exploratory data analysis of the target variable revealed a clear class imbalance, with the “NO” readmission class accounting for approximately 50% of the entire dataset, evaluation metrics that go beyond accuracy alone are crucial for properly evaluating model performance. To provide a more comprehensive assessment of model performance across all classes, below are the selected evaluation metrics:- **Accuracy:** This metric is used to measure the overall correctness of predictions across all samples, proportion of correct predictions. Accuracy is selected as a baseline comparison measure since it is a quick indicator of overall model performance, but can be misleading when classes are imbalanced.- **Precision:** This metric measures the proportion of predicted positives that were actually correct. Precision for each class `(Precision_NO, Precision<30, Precision>30)`. Evaluates how accurate the model's predictions are when it predicts a particular class. Precision is important metric when false positives are costly. Thus, this metric is selected because misclassifying a low-risk patient for a high-risk patient is costly for the classification task.- **Recall (Sensitivity):** This metric measures model's ability to detect positive cases. Recall for each class `(Recall_NO, Recall<30, Recall>30)`. Assesses the model’s ability to correctly identify actual instances within each class. This metric is chosen because it helps assess the model’s capability to distinguish between different readmission timeframes, which is crucial for clinical decision-making and targeted interventions.- **F1 Score:** This metric measures the harmonic mean of precision and recall, capturing the balance between the two. F1 Score for each class measures the harmonic mean of Precision and Recall, providing a balanced evaluation for each class. In imbalanced datasets, F1 is more informative than accuracy as it balances the trade-off between precision and recall, giving a more realistic measure of a model's performance on the minority class.- **Macro F1 Score:** This metric measures the arithmetic mean of the F1 scores across all three classes, giving equal weight to each class and offering a fair assessment of overall performance. Macro F1 Score treats all classes equally, making it a proper measure of performance for imbalanced dataset, giving insight into whether the model is biased toward the majority class.Given that this task represents a typical multi-class classification problem with imbalanced class distribution (e.g., NO cases are significantly more common than \<30 cases), relying solely on Accuracy is insufficient for a comprehensive evaluation. Therefore, we emphasize per-class Precision, Recall, and F1 scores, which are particularly important for assessing how well the model performs on the high-risk class of early readmission (\<30). Additionally, the use of Macro F1 Score helps mitigate the dominance of the majority class and provides a more representative evaluation of model performance.:::### 6. Classification Models::: {style="text-align: justify; font-size: 10pt;"}Five classification models are selected for fitting and evaluating the dataset, each suited to the "Diabetes 130-US Hospitals for Years 1999-2008" dataset characteristics and classification goals. These classification models are:1. **Decision Tree:** Decision Tree are well-suited for our diabetes dataset with mixed-type variables and do not require standardization. The model offers high interpretability by allowing traceability of decision paths. It also supports class weighting, which can help improve detection of underrepresented classes such as \<30. Evaluation will focus on class-wise recall and macro-averaged F1 score to assess balanced performance.2. **Random Forest:** Random Forest aggregates multiple decision trees to reduce overfitting and better capture complex feature interactions (num_inpatient, num_lab_procedures). It is robust to feature transformations, handles class imbalance effectively, and provides variable importance rankings. Model evaluation includes macro F1, and precision-recall to capture both overall and minority class performance.3. **Support Vector Machine (SVM):** SVM is suitable for our high-dimensional dataset and is capable of learning non-linear decision boundaries via kernel functions. With class weighting enabled`(class_weight='balanced')`, SVM can better manage imbalance, particularly for the \<30 class. Performance will be evaluated using precision, recall, F1, macro F1, AUC, with emphasis on the minority class.4. **k-Nearest Neighbors (k-NN):** k-NN is a distance-based, non-parametric method that relies on similarity across feature space. As it is sensitive to feature scaling, standardization was applied. While k-NN serves as a strong baseline, it may favor majority classes. Therefore, it is used primarily for benchmarking, with evaluation based on per-class recall, per-class precision ,per-class F1, macro F1, Weighted F1 Score and MCC.5. **Logistic Regression:** Logistic regression is appropriate for modeling multi-class classification tasks with interpretable, probability-based outputs. It also supports class weighting, which is helpful for addressing class imbalance. Evaluation metrics includes accuracy, sensitivity, specificity, F1 score, macro F1 and weighted F1 score in both binary and multi-class settings.:::### 7. Innovation::: {style="text-align: justify; font-size: 10pt;"}This project team made a deliberate effort to go beyond the project's requirements and demonstrated innovation across the project plan and EDA. These innovations include: the selection of a challenging and novel dataset that meets all 5 dataset selection guidelines, inclusive and accessible visuals (all plots used color-blind–friendly colors and plots were made interactive with ggplotly to support dynamic exploration), domain-related feature engineering (transformation of the diagnosis features using ICD-9 codes map), and the use of kableExtra library intuitively displaying table results.:::### 8. References